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Abstract 

Bisulfite sequencing (BS-seq) is the gold standard for studying genome-wide DNA methylation. We developed 
MOABS to increase the speed, accuracy, statistical power and biological relevance of BS-seq data analysis. MOABS 
detects differential methylation with 10-fold coverage at single-CpG resolution based on a Beta-Binomial hierarchical 
model and is capable of processing two billion reads in 24 CPU hours. Here, using simulated and real BS-seq data, 
we demonstrate that MOABS outperforms other leading algorithms, such as Fisher's exact test and BSmooth. 
Furthermore, MOABS analysis can be easily extended to differential 5hmC analysis using RRBS and oxBS-seq. 
MOABS is available at http://code.google.eom/p/moabs/. 



Background 

DNA methylation, an epigenetic modification affecting 
organization and function of the genome, plays a critical 
role in both normal development and disease. Until re- 
cently, the only known DNA methylation was 5- 
methylcytosine (5mC) at CpG dinucleotides, which is 
generally associated with transcriptional repression [1]. 
In 2009, another form of DNA methylation termed 5- 
hydroxymethylcytosine (5hmC) [2] was found to be in- 
volved in active demethylation [3] and gene regulation 
[4]. Understanding the functional role of DNA methyla- 
tion requires knowledge of its distribution in the gen- 
ome [5,6]. Bisulfite conversion of unmethylated Cs to Ts 
followed by deep sequencing (BS-Seq) has emerged as 
the gold standard to study genome-wide DNA methyla- 
tion at single-nucleotide resolution. The most popular 
protocols include RRBS (Reduced Representation Bisul- 
fite Sequencing) [7] and WGBS (Whole Genome Bisul- 
fite Sequencing) [8] for the combination of 5mc and 
5hmc, oxBS-Seq (Oxidative Bisulfite Sequencing) [9] for 
5mc and TAB-Seq (Tet-assisted Bisulfite Sequencing) [10] 
for 5hmc, respectively. After mapping BS-seq reads to the 
genome, the proportion of unchanged Cs is regarded as 
the absolute DNA methylation level. Due to random sam- 
pling nature of BS-seq, deep sequencing (e.g. >30 fold) is 
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usually required to reduce the measurement error. Techno- 
logical advances and reduced costs have seen a significant 
increase in interest in BS-seq among biologists. Currently, 
BS-seq is widely used by small laboratories to profile cell 
lines and animal models [11], as well as by large consor- 
tiums such as the NIH ENCODE, Roadmap Epigenomics, 
The Cancer Genome Atlas (TCGA), and European BLUE- 
PRINT to profile thousands of cell populations. Hence, it is 
expected that BS-seq data will continue to grow exponen- 
tially. However, despite recent progress [7,12-14], computa- 
tional methods designed for issues specific to BS-seq are 
much less developed than those for other sequencing appli- 
cations such as ChlP-Seq and RNA-seq. 

The most fundamental aspects of BS-seq data analysis 
include read mapping and differential methylation detec- 
tion. We previously developed one of the most widely 
used BS mapping programmed BSMAP [15]. After read 
mapping, the most common task is the identification of 
differentially methylated regions (DMRs) between sam- 
ples, such as disease versus normal. Based on the bio- 
logical question, DMRs can range in size from a single 
CpG (DMC: differentially methylated CpG) to tens of mil- 
lions of bases. Although several statistical methods have 
been applied to DMR detection [12], among which Fishers 
exact test p-value (FETP) method [16] is the most popular, 
several challenges remain to be addressed. 1) Statistical 
Power: most previous methods are very conservative in 
power and require deep sequencing (e.g. 30 fold). For 
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example, Hansen [13] recently calculated that for single 
CpG methylation level "even 30x coverage yields standard 
error as large as 0.09". As a compromise, many studies as- 
sumed that neighboring CpGs have similar methylation 
levels, thus can be combined together within a genomic 
region (e.g. 1 kb) to increase the statistical power [17]. For 
example, BSmooth [13] performs local smoothing fol- 
lowed by t-test for DMR detection. While this strategy 
may be applicable in many cases, regional average analysis 
will unfortunately miss low- CpG -density DMRs that are 
abundant in the genome and critical for gene expression, 
such as TFBSs. Most TFBSs are small (i.e. < 50 bp) as im- 
plied by high-resolution ChlP-seq and ChlP-exo experi- 
ments [18] and contain few or even a single CpG(s) that 
are in general differentially methylated compared to sur- 
rounding ones, thus are very likely to be "overlooked" by 
the regional average analysis. 2) Biological Significance: 
previous methods use p-value for statistical significance of 
DMR. This p-value metric only tells whether a region is 
differentially methylated, but does not directly measure 
the magnitude of the methylation difference. A similar 
problem also exists in gene expression profiling, where the 
p-value does not directly measure the expression fold- 
change [19]. Since sequencing depth in BS-seq experi- 
ments can fluctuate by an order of magnitude in different 
loci, a very small methylation difference, although not bio- 
logically meaningful, can easily return a significant p-value 
if the underlying sequencing depth is deep enough. On 
the other hand, the nominal methylation difference, i.e. 
direct subtraction of two methylation ratios, suffers signifi- 
cantly from the random sampling error such that a large 
difference with low sequencing depth is not likely to be 
statistically meaningful. 3) Biological Variation is an essen- 
tial feature of DNA methylation [20], and should be han- 
dled carefully to detect reproducible DMRs that represent 
the common characteristics of the sample group. How- 
ever, most previous methods fail to account for biological 
variation between replicates, and simply pool the raw data 
from replicates for DMR detection. Some of the resulting 
DMRs may have significant differences at the mean level 
but might not be reproducible between replicates, and 
hence are "false-positives". To our knowledge, BSmooth 
[13] is the first replicate-aware program that accounted 
for biological variation using a modified t-test. 

In response to these challenges, we developed a power- 
ful differential methylation analysis algorithm termed 
MOABS: Model-based Analysis of Bisulfite Sequencing 
data. Its source code is available as Additional file 1. 
MOABS uses a Beta-Binomial hierarchical model to cap- 
ture both sampling and biological variations, and accord- 
ingly adjusts observed nominal methylation difference by 
sequencing depth and sample reproducibility. The result- 
ing credible methylation difference (CDIF) is a single 
metric that combines both biological and statistical 



significance of differential methylation. Using both simu- 
lated and real whole-genome BS-seq data from mouse 
brain tissues and stem cells, we demonstrate the superior 
performance of MOABS over other leading methods, es- 
pecially at low sequencing depth. Furthermore, one prac- 
tical challenge is that BS-seq data analysis is usually 
computational intensive, and requires multiple steps. We 
therefore seamlessly integrate several major BS-seq pro- 
cessing procedures into MOABS, including read mapping, 
methylation ratio calling, identification of hypo- or hyper- 
methylated regions from one sample, and differential 
methylation from multiple samples. MOABS is imple- 
mented in C++ with highly efficient numerical algorithms, 
and thus is at least 10 times faster than other popular 
packages. For example, it takes only 24 CPU hours to de- 
tect differential methylation from 2 billion aligned reads. 
Together, MOABS provides a comprehensive, accurate, ef- 
ficient and user-friendly solution for analyzing large-scale 
BS-seq data. 

Results and discussion 

Beta-Binomial hierarchical model for both sampling and 
biological variations 

For a single CpG locus in the ;-th biological replicate of 
condition z, we denote the number of total reads, the 
number of methylated reads and methylation ratio as n t p 
kij and pip respectively. For a typical two group compari- 
son, i = 1,2 and ; = 1, 2, N, where N is the number of 
replicates in each condition. The ny and k% are observa- 
tions from experiments, while the py is unknown with 
kij/riij as its nominal estimation. Given p^nd n t p the 
number of methylated reads kg is characterized by the 
sampling variation from sequencing and can be modeled 
by a Binomial distribution: kg ~ Binomial(nipPij). The 
posterior distribution of the methylation ratio p^ will 
then follow a Beta distribution Beta{<Xip^ij) and can be 
estimated using an Empirical Bayes approach. The prior 
distribution will be estimated from the whole genome, in 
which most CpGs are either fully methylated or fully un- 
methylated, resulting in a bimodal distribution. The Em- 
pirical Bayes approach will automatically incorporate 
such bimodal information in the methylation ratio esti- 
mation and hence increases the power of our analysis. 

When biological replicates are available, we will refine 
the posterior distribution of p t j with biological variation 
from the Bayesian perspective. Specifically, a t and p; will 
be treated as random variables with a prior distribution 
estimated from all the CpGs in the genome similar to 
the Empirical Bayes priors. We will then use maximum 
likelihood approach to generate the posterior distribu- 
tion of pi. Typical posterior distributions of four CpGs 
are shown in Figure la, in which all CpGs have the same 
average methylation ratios and the same total number of 
reads. Their methylation ratios would have identical Beta 
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Figure 1 Overview of the MOABS algorithm, (a) Posterior distribution of methylation ratio inferred from biological replicates. Each curve represents 
the inferred methylation ratio Beta distribution of a CpG. The symbols at the bottom indicate the observed methylation ratios of all replicates. The values 
on the top right corner indicate number of methylated reads over number of total reads in each replicate, (b) An example of Credible Methylation 
Difference (CDIF). Dash curves indicate inferred methylation ratio Beta distributions from low (Sample #1) or high sequencing depth (Sample #2). The black 
curve is the exact distribution of the methylation difference between two samples. The CDIF is shown as the lower bound of the 95% confidence interval, 
(c) Ranking of three CpG examples by CDIF, FEYP p-value and nominal difference, i.e. direct subtraction of two methylation ratios. The three curves are the 
exact distributions of methylation differences. The corresponding CDIF values are show as vertical dash lines. 



distributions (black curve on CpG #1) if biological variation 
was not considered. Our method is able to adjust the poster- 
ior distribution of p t based on observed biological variation. 
For example, highly variable replicates on CpG #2 results in 
a bimodal distribution, whereas reproducible replicates on 
CpG #3 leads to a normal-like distribution. Furthermore, in- 
creasing the number of reproducible replicates from 2 to 3 
on CpG #4 will reduce the variation of the resulting poster- 
ior distribution. Taken together, the posterior distribution of 
the methylation ratio in condition i will be determined by its 
prior distribution, sequencing depth, and the degree of vari- 
ation between replicates. 

Credible methylation difference (CDIF) is a single metric 
for both statistical and biological significance of 
differential methylation 

We illustrate the idea of CDIF using a simple experi- 
mental design, in which only one sample (N=l) is 



sequenced for each of the two conditions: ~ Binomial 
(n i} pi) and p l >~ Beta(a it f}i), i = 1, 2. The Empirical Bayes 
priors cf},ft} of p t will be estimated from all the CpGs in 
the genome by maximizing a marginal likelihood func- 
tion using the quasi-Newton optimization method [21]. 
In this case, there is no biological variation, so Beta{a b 
fii) will be only determined by the prior distribution and 
sequencing depth: at = ki + a® and /?. = nt-ki + ft}. An 
example is shown in Figure lb. Due to low sequencing 
depth (/q = 9; n x = 10), sample #l's Beta distribution has 
higher variance than that of sample #2 with high se- 
quencing depth (k 2 = 12;n 2 = 80). The methylation ratio 
difference between two samples is denoted as t = p 2 - p 2 . 
One immediate question is how to estimate the confi- 
dence interval CI(a,b) of t. Many methods have been 
proposed but their merits have been subject to debate 
[22]. We therefore propose to use the exact numer- 
ical solution [23] to solve CI(a,b). CDIF is then 
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defined as the distance between 0 and the 95% CI 
(a,b) (Figure lb): 

if a>0 
0, ifa<0<b 
b, if b<0 

In practice, CDIF represents the conservative estimation of 
the true methylation difference, i.e. for 97.5% of chance the 
absolute value of true methylation difference is greater than or 
equal to that of CDIF. The CDIF value will be assigned to 0 is 
there is no significant difference. Constructed in this way, the 
CDIF value, if greater than the resolution defined as min(l/w 1 , 
l/« 2 ), guarantees a significant p-value from Fishers exact test, 
and at the same time represents the magnitude of methylation 
difference. The sequencing depth will largely influence CDIF, 
since bigger n t will make a smaller 95% CI of the methylation 
difference, normally resulting in greater CDIF value. 

We believe CDIF is a better metric to capture the methy- 
lation difference than statistical p-value or nominal met- 
hylation difference. Three CpG examples are shown in 



Figure lc. According to p-value 1.4e-10, CpG #3 is the most 
significant one. However, this significant p-value, which is 
largely driven by the high sequencing depth, does not cor- 
rectly represent the actual biological difference of 0.3, which 
is the smallest among three CpGs. On the other hand, if we 
use nominal difference, CpG #2 would be the most signifi- 
cant. However, its low sequencing depth makes this high 
nominal difference unreliable. CDIF is able to penalize the 
nominal difference according to its statistical significance 
and ranks CpG #1 as the most significant followed by CpGs 
#2 and #3, although CpG #1 does not have the most signifi- 
cant p-value or nominal difference. Taken together, CDIF 
reaches a well balance between statistical and biological sig- 
nificance and gives a more stable and biological meaningful 
interpretation and ranking of differential methylation. 



Functions and performance of the MOABS pipeline 

We have implemented MOABS as a comprehensive soft- 
ware pipeline (Figure 2a), including read alignment, quality 



SampleA_rep1 


SampleA_rep2 




SampleB_rep1 


SampleB_rep2 


\ 


f \ 


f 










Read Alignment 

Trim adaptor and low quality end 


> 


f 


Quality Control 

Adjust amplification bias, end-repair bias, and short fragment bias 
Estimate bisulfite conversion rate 


> 


t 



Single Sample Analysis 

Call methylation ratio at each cytosine 
Report confidence interval 
Generate descriptive figures, tables, and browser visualization 
Detect CpGs with strand bias of methylation 
Detect hypo- or hyper- methylated regions 



Multiple Sample Analysis 

Identify de novo of differential methylated cytosines and regions 
Examine differential methylation levels of pre-defined regions 
Report correlation between multiple samples 



Scale 
chr11: 



10 kb |— 

69,390,000 
Wrp53 



H mm9 



F-tt 



Trp53 



i n m il in 



100- 



=3 m 
O c\i ■ 

o 

CD 8- 

Q. 

O m. 




0% meth: 78.8% 
1%-25% meth: 12.6% 
75%-99% meth: 3% 
100% meth: 2.6% 



CM (j 
CL 

,lo Q 



I 1 1 1 

20 40 60 80 100 

Methylation ratio (%) 
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control (QC), single sample analysis and multiple sample 
comparative analysis. 1) The read alignment model is a 
wrapper of popular bisulfite mapping programs, such as 
BSMAP [15], which allows the trimming of low quality 
band adaptor sequences, as well as supports parallel com- 
puting on a cluster. 2) The QC module adjusts biases in 
PCR amplification, end-repair, bisulfite conversion failure, 
and etc. [24]. In addition, it can also estimate bisulfite con- 
version rate based on cytosines in the non-CpG content. 3) 
Single sample analysis reports CpG or CpH methylation ra- 
tios with corresponding confidence intervals, detects hypo- 
or hyper- methylated regions (e.g. Trp53 gene in Figure 2b) 
in the genome [25], and provides general statistics with de- 
scriptive figures (an example of the mouse methylome [25] 
is shown in Figure 2c). 3) For multiple sample comparative 
analysis, MOABS detects de novo DMCs, which can be fur- 
ther grouped into DMRs using a Hidden Markov Model. 
MOABS can also examine the differential methylation 
levels of pre-defined regions, such as promoters. 

All the modules are wrapped in a single master script 
such that users can specify the input BS-seq reads and 
run all the modules one by one automatically. The 
MOABS pipeline is developed using C++ with highly ef- 
ficient numerical algorithms, native multiple-threading 
and cluster support so that multiple jobs can run in par- 
allel on different computing nodes. Several mathematical 
and computational optimizations have made MOABS 
pipeline extremely efficient. For example, it takes only 
one hour on 24 CPUs (IBM power7 4 Ghz) to detect dif- 
ferential methylation for approximately 30 million CpGs 
in the human genome based on 2 billion aligned reads. 
MOABS is significantly faster than other software. For 
example, a benchmark (Additional file 2: Table SI) based 
on public BS-seq data in mouse hematopoietic stem cell 
(HSC) [26] reveals that MOABS is roughly 3.3, 1.7, and 
1.4 times faster than BSmooth in bisulfite mapping, 
methylation call and differential methylation analysis, re- 
spectively. In summary, MOABS is a comprehensive, ac- 
curate, efficient and user-friendly solution for analyzing 
large-scale BS-seq data. 

Simulated BS-seq data reveals the superior performance 
of MOABS 

To assess the performance of MOABS on differentially 
methylated CpGs (DMCs), we simulated 0.1 million true 
positive CpGs with large methylation difference and 0.9 
million true negative CpGs (Additional file 3: Figure SI) 
from a HI methylome [16], and then compared MOABS 
with FETP at 5% false discovery rate (FDR) (Figure 3). 
Note that this evaluation is at single CpG resolution 
without local smoothing, therefore BSmooth [13] cannot 
be used. The results indicate that MOABS clearly out- 
performs FETP with the most dramatic difference ob- 
served at low sequencing depth. For example, with 
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Figure 3 Comparison between MOABS and FETP in detecting 
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negative CpGs were initially assigned the same methylation ratios. The 
density of the methylation ratios fits a bimodal distribution (Additional file 
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one sample and high ratios [0.75,1] in the other sample, respectively. Each 
methylation ratio was then given a +/-0.05 fluctuation to simulate BS-seq 
errors. Sequencing depth is randomly sampled from 5-fold to 50-fold. The 
Y-axis shows the percentage of true DMCs predicted at 5% FDR. 



sequencing depth at 5-10 fold, MOABS can recover 55- 
75% true positives while FETP only predicts 13-51% true 
positives. To further evaluate the performance of 
MOABS at different methylation levels, we re-simulated 
the 0.1 million true positive CpGs with different baseline 
methylation levels (0% -100%) and methylation differ- 
ences (20% - 100%). The results (Additional file 3: Figure 
S2) indicate that MOABS is more accurate than FETP at 
any sequencing depth and at any methylation difference. 
Notably, the difference between the two methods is large 
when sequencing depth is low or when methylation dif- 
ference is moderate (50% ~ 70%). In contrast, the differ- 
ence between methods is small when sequencing depth 
is high or when the methylation difference is either very 
high (80% ~ 100%) or very low (-20%). Although FETP 
is well suited for the analysis of discrete data, it has less 
power for DNA methylation, which by its nature is a 
continuous rather than discrete random variable. The 
improved power of MOABS results from the modeling 
of DNA methylation using a Beta-Binomial hierarchical 
model and the Empirical Bayes approach to borrow in- 
formation from all the CpGs in the genome. The testing 
data used for the method validation above is included in 
Additional file 4. 
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MOABS improves the detection of allele specific DNA 
methylation 

To assess how MOABS performs on DMRs for real BS- 
seq data, we compared MOABS with FETP and 
BSmooth [13] using allele-specific mouse methylomes 
[25], in which a list of well-known imprinted DMRs can 
serve as gold standard for method evaluation. Xie et al 
[25] used FETP followed by clustering of DMCs for 
DMR detection. They confirmed 32 known experimen- 
tally verified imprinted DMRs (Additional file 5: Table 

52) and reported 20 novel ones by pooling two biological 
replicates without considering sample variation. We no- 
ticed that two known DMRs (Ndn and Igf2r) are weak, 
exhibiting a very small methylation difference of ap- 
proximately 10%. We also found that 3 novel DMRs they 
reported (Vwde, Cascl and Nhlrcl) are differentially 
methylated in only one of the two replicates, and thus 
are likely to be false positives (Additional file 3: Figure 

53) . Since the remaining 17 novel DMRs have yet to be 
experimentally verified, we decided to remove them 
from our analysis. In our method evaluation, we used 
the 32 known DMRs as true positives and the remaining 
genome (with 17 reproducible novel DMRs removed) as 
true negatives. To allow for a fair comparison, we used 
the same method to calculate FDR for all three methods. 
In addition, we used the same procedure to cluster 
DMCs into DMRs for MOABS and FETP. The resulting 
ROC-like curves (Figure 4a) clearly indicate that 
MOABS is superior to the other two methods. MOABS 
successfully reports all 32 known DMRs including the 
two weak ones at 11% FDR, and 4 "false positive" new 
DMRs (Cdh20, Trappc9, Pcdhb20 and Pfdn4). Manual 
inspection (Additional file 3: Figure S4) confirms that 
these 4 "false positive" are indeed regions showing differ- 
ential methylation in both replicates. Hence the 11% 



FDR of MOABS is significantly over estimated based on 
incomplete true positives. Interestingly, our FETP ana- 
lysis predicts 7 new DMRs in addition to 32 known 
DMRs, suggesting additional filtering steps may have 
been performed in Xie et al. [25]. Among these 7 DMRs, 
one greatly overlaps with the new DMR Pcdhb20 re- 
ported by MOABS, while the other 6, including Vwde 
and Cascl and Nhlcl, show poor correlation between 
replicates. Finally, the ROC-like curve indicates that 
BSmooth is less accurate than either FETP or MOABS. 

The 32 known DMRs can be easily detected by both 
MOABS and FETP mainly because they have large 
methylation differences and high read depth (54-fold in 
DMR regions), which is consistent with our simulation 
study. However, deep bisulfite sequencing of the mam- 
malian genome is still quite expensive. This reality moti- 
vated us to test to what extent these known DMRs can 
still be recovered at a lower sequencing depth. The same 
previous procedure was applied to compare all three 
methods. The number of recovered known DMRs at 5% 
FDR is plotted at each sequencing depth from random 
sampling (Figure 4b). We observe that the lower sequen- 
cing depth, the greater performance difference between 
MOABS and FETP. For example, when the depth is at 
11-fold, MOABS recovers roughly 90% of known DMRs, 
while FETP only detects 78% of DMRs. When the depth 
is further lowed to 3.1 -fold, MOABS can still recover 
roughly 70% of known DMRs, while FETP detects 44% 
DMRs. Interestingly, BSmooths performance is largely 
independent of sequencing depth, probably because it 
was designed mainly for low sequencing depth. Indeed, 
at a low depth of 3.1 -fold, BSmooth outperforms FETP. 
However, at sequencing depth higher than 3.1 -fold, 
BSmooth has a lower sensitivity than the other two 
methods. Collectively, we conclude that MOABS is 
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Figure 4 MOABS improves the detection of allele specific DNA methylation. (a) The y-axis shows the number of known DMRs recovered by 
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superior in DMR detection, especially at low sequencing 
depth. 

MOABS reliably reveals differential methylation 
underlying TFBSs 

Since the previous benchmark is based on a small num- 
ber of experimentally verified DMRs, we sought to fur- 
ther evaluate the performance of MOABS based on 
larger scale datasets. The link between differential 
methylation and TFBSs provides such a good sys- 
tem. TFBSs are usually hypo-methylated compared to 
surrounding genome background; therefore, a tissue 
specific TFBS is expected to be a tissue specific hypo- 
methylated-DMR (hypo-DMR). To test this hypothesis, 
we performed deep (46-fold) WGBS of the mouse 
hematopoietic stem cell (HSC), and compared the HSC 
methylome with that of a publically available mouse em- 
bryonic stem cell (ESC) [27]. The HSC methylome data 
is accessible at NCBI GEO Accession GSE47815. The 
HSC-specific hypo-DMR were then compared with ap- 
proximately 58,000 in vivo ChlP-seq TFBSs of 10 major 
HSC specific TFs [28], including Erg, Flil, Gata2, Gfilb, 
Lmo2, Lyll, Meisl, Pu.l, Runxl and Scl. Figure 5a illus- 
trates the hypo-methylation of a TFBS in Runx2 gene. At 
the center of the TFBS co-bound by Runxl, Gata2 and 
Scl, there are 2 CpGs fully methylated in mouse ESC but 
unmethylated in HSC, while the surrounding regions 
are almost fully methylated in both HSC and ESC. 
Additional file 3: Figure S5 shows more examples of tis- 
sue specific hypo-DMR coupled with tissue specific 
TFBSs. Such TFBS associated hypo-methylated regions are 
usually very small and abundant in the genome. Using 
Runxl as an example, 71% of the 4793 Runxl TFBSs show 
hypo-methylation, while the remaining TFBSs are either 
fully methylated or have no underlying CpGs. Toge- 
ther, -34% of TFBS associated hypo-methylated regions 
contain no more than 3 CpGs with a median length of 
51 bp (Figure 5b). Furthermore, 14% of such regions even 
have a single CpG. For such small DMRs, single CpG level 
differential analysis is essential since regional averaging is 
very likely to overlook most of them. 

We then used TFBSs to evaluate DMC detection as- 
suming tissue-specific TF binding is associated with 
tissue-specific hypo-methylation. For a fair comparison, 
we calculated FDR for each method based on a method- 
specific null distribution obtained through permutation 
of read sample labels. At FDR of 5%, MOABS, FETP and 
BSmooth predicted 32,867, 32,047 and 18,021 differen- 
tially methylated TFBSs respectively (Figure 5c). We also 
used a method similar to Gene Set Enrichment Analysis 
(GSEA [29]) to test enrichment of TFBS moving down 
the lists of DMCs ranked by different methods. MOABS 
shows the highest enrichment score (Figure 5d) of TFBS. 
For example, with the same 4,000 most significant 



DMCs, MOABS recovers 1,000 TFBSs while FETP only 
predicts -600 TFBSs (i.e., 40% less). 

In this instance, the sequencing depth is sufficient to en- 
able MOABS and FETP to recover very similar number of 
TFBSs. However, when we randomly sampled reads to a 
depth of 4-fold, MOABS recovered many more TFBS 
(15,349) than FETP (7,520) and BSmooth (4,028) (Figure 5e). 
Again, at this low sequencing depth, MOABS not only re- 
covers 2-3 fold more TFBSs, but also exhibit more signifi- 
cant score of TFBS enrichment in the most significant 
DMCs. In both high and low sequencing depths, BSmooth 
recovers fewer TFBSs mainly because its smoothing func- 
tion easily ignores small region with a few CpGs. Together, 
using tissue specific in vivo TFBSs, we demonstrate that 
MOABS can better recover differential methylation in small 
regulatory regions with a few CpGs, especially at low se- 
quencing depth (e.g. 4-fold). 

MOABS detects differential 5hmc in ES cells using RRBS 
and oxBS-Seq 

To demonstrate the broad utility of MOABS, we ana- 
lyzed 5hmc data using RRBS and oxBS-seq [9]. RRBS 
measures both 5mc and 5hmc together while oxBS-Seq 
[9] detects 5mc directly. The 5hmc level can then be in- 
ferred by the difference between RRBS and oxBS-Seq of 
the same sample. The 5hmc level is often very small (e.g. 
at 5%) and hence its detection requires hundreds of fold 
coverage using FETP [9]. Our simulation study indicates 
that MOABS can significantly reduce the depth require- 
ment (Figure 6a). For example, to detect 5hmc at 5% 
when 5mc is at 0%, MOABS requires 80-fold coverage 
while FETP needs -200-fold. However, when the 5mc 
level is close to 50%, significantly more reads will be 
needed for both methods (~ 120-fold for MOABS and 
> 500-fold for FETP). The differential 5hmc between two 
samples can be inferred by the difference of two CDIF 
values, each of which is the difference between RRBS 
and oxBS-Seq of the same sample. The similar numerical 
approach can then be used to infer the distribution of 
the difference of the difference between two Beta distri- 
butions, which are used to model BS-seq data. Figure 6b 
shows an example, in which 5hmc is measured by both 
RRBS and oxBS-Seq in two samples. FETP shows more 
significant p-value for 5hmc in sample #1 than in #2, 
whereas MOABS CDIF is bigger in sample #2 than in 
#1. However, the significance of FETP on sample #1 is 
largely driven by the high sequencing depth, thus does 
not correctly represent the actual biological difference. 
In contrast, MOABS CDIF reaches a balance between 
statistical and biological significance and gives a bio- 
logically meaningful differential 5hmc at CDIF value of 
0.06 (0.29-0.23). 

When applied to RRBS and oxBS-seq data derived 
from ES cell lines with different passages [9], MOABS 
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Figure 5 MOABS reveals differential methylation underlying TFBSs. (a) UCSC genome browser illustration of one TF binding site. The tracks 
from top to bottom are genomic positions, RefSeq Gene, HSC Methylation, ESC Methylation, and TFBS. For each CpG, an upward bar denotes the 
methylation ratio, (b) Distribution of the number of DMCs underlying TFBSs. The inserted boxplot indicates the length distribution of TFBSs with 
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and BSMOOTH respectively. 



reported 299 genes with decreased 5hmc and 125 genes 
with increased 5hmc (Additional file 6: Table S3) in pro- 
moters in the later passage P20, which is consistent with 
the mass spectrometry data [9] that shows overall re- 
duced 5hmc in later passage. This result implies that the 
epigenetic stability of ES cells is impacted by prolonged 
in vitro culture. This is an important issue for both the 



safety and efficacy of stem cell-derived tissues in cell- 
replacement therapies as well as the appropriate in- 
terpretation of experimental models. Mono-allelic gene 
expression, including genomic imprinting, is primarily 
regulated through epigenetic mechanisms and thus can 
serve as a useful model of epigenetic stability. As ex- 
pected, our analysis identified five imprinted genes with 
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decreased 5hmc: Plagll, Sfmbt2, Gprl, Kcnql and 
Kcnqlotl, as well as one imprinted gene with increased 
5hmc, Pcdha4-g. 

The role of 5hmc in disease remains unclear. A recent 
study suggests that genome-wide loss of 5hmc is an epi- 
genetic feature of neurodegenerative Huntington's disease 
[30]. The authors identified 559 genes with decreased 
5hmc in the diseased mice compared to healthy controls. 
A considerable fraction of these disease-specific genes were 
uncovered in our differential 5hmc analysis in ES cells. 
This included 26 of 299 and 11 of 125 genes (overlapping 
p-value < 8e-5) with decreased and increased 5hmc, re- 
spectively. These results suggest that one potential conse- 
quence of decreased epigenetic stability over time in ES 
cells is the acquisition of pathological epimutations. 

The observed bias toward loss of 5hmc in ES cells upon 
long-term culture may also suggest stem cell properties, such 
as pluripotency, are affected. Ficz and colleagues [31] showed 
that knockdown of Tetl/Tet2 in mouse ES cells down regu- 
lates epigenetic reprogramming and pluripotency-related 
genes such as Esrrb, Klf2, Tell, Zfp42, Dppa3, Ecatl and 
Prdml4. Decreased expression was concomitant with both 
decreased 5hmC and increased 5mC at the gene promoters. 
In our differential 5hmc analysis in ES cells, we observed de- 
creased 5hmc at three of these genes: Ecatl, Esrrb, and 
Zfp42. Together, we conclude that MOABS can be used ef- 
fectively to infer differential 5hmc using RRBS and oxBS- 
Seq. 

Conclusions 

While progress in next-generation sequencing allows in- 
creasingly affordable BS-seq experiments, the resulting 
data generated poses significant and unique bioinformat- 
ics challenges. The lack of efficient computational 
methods is the major bottleneck that prevents a broad 



adoption of such powerful technologies. In response to 
this challenge, we developed MAOBS, an accurate, com- 
prehensive, efficient, and user- friendly pipeline for BS- 
seq data analysis. The MOABS analysis is novel and sig- 
nificant in two major aspects: 1) MOABS CDIF value 
provides an innovative strategy to combine statistical p- 
value and biological difference into a single metric, 
which will bring biological relevance to the interpret- 
ation of the DNA methylation data. 2) MOABS does not 
sacrifice resolution with low sequencing depth. By rely- 
ing on the Beta-Binomial Hierarchical Model and Empir- 
ical Bayes approach, MOABS has enough power to 
detect single-CpG-resolution differential methylation in 
low-CpG -density regulatory regions, such as TFBSs, with 
as low as 10-fold. The low-depth BS-seq experimental de- 
sign enables remarkable cost reduction per sample. In 
Figure 3 simulated data, we showed that MOABS achieved 
roughly 80% sensitivity with 5% FDR at 10-fold sequen- 
cing depth. In Figure 4b real data, we showed that as se- 
quencing depth decreased to 11 -fold by sampling, 
MOABS recovered roughly 90% of known DMRs. The 
MOABS sensitivity starts to drop dramatically when se- 
quencing depth is further reduced. Based on the above 
two observations, we would recommend low-depth (e.g. 
10-fold) BS-seq on more biological samples with the same 
limited budget, which in most scenarios will provide 
greater biological insights than high-depth BS-seq on 
fewer samples. 

Copy Number Variation (CNV) is a common issue in 
many disease related bisulfite sequencing. The sequen- 
cing depth is normally higher or lower in high (or low) 
copy-number regions and this depth bias has an impact 
on our CDIF calculation. To correct this bias, we have 
included a separate script redepth.pr in the MOABS 
package. Users can select their favorite CNV detection 
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tools [32], such as CNV-Seq, Control-FREEC and VarS- 
can, to predict the CNV region from genome sequencing 
or bisulfite sequencing. Nearly all these tools output a 
bed file of CNV regions with predicted copy number 
based on a p-value cutoff. The script redepth.pl' manip- 
ulates the read alignment BAM files according to the 
CNV prediction. If a read is located in a CNV region 
with a predicted copy number of X in a diploid genome, 
the read will have a probability of 2/X to be kept in the 
new BAM files. Reads in the non-CNV regions will keep 
unchanged. This process will result in CNV bias free 
BAM files for downstream analysis. 

Large-scale case-control epigenome-wide association 
study (EWAS) is a powerful strategy to identify disease- 
associated epigenetic biomarkers. Currently, most stud- 
ies use Illumina bisulfite arrays (e.g. 450 K) mainly due 
to the cost constraint. MOABS in theory can also be ap- 
plied to such studies when EWAS bisulfite sequencing 
data are publicly available. 

In summary, as DNA methylation is increasingly rec- 
ognized as a key regulator of genomic function, deci- 
phering its genome-wide distribution using BS-seq in 
numerous samples and conditions will continue to be a 
major research interest. MOABS significantly increase 
the speed, accuracy, statistical power and biological rele- 
vance of the BS-seq data analysis. We believe that 
MOABS s superior performance will greatly facilitate the 
study of epigenetic regulation in numerous biological 
systems and disease models. 

Materials and methods 

The major portions of the methods for the model are 
described here. In the Additional file 7, we provide more 
details and additional methods to make the model 
complete. 

Distribution for difference of two Binomial proportions 

In the Additional method section (Additional file 7) we 
show that a methylation ratio p inferred from k methyl- 
ated cytosines out of n total reads, follows a Beta distri- 
bution from the Bayesian perspective. The probability 
density function is 

/(p; n, k) = Be(a, p) = -f 1 P) , (1) 

/ p«-\l-pf- l dp 
Jo 

where a = k + a 0 , |3 = n-k + (3 0 , if Be(a 0 , |3 0 ) is priori distri- 
bution for p. We also give formulas to numerically cal- 
culate the confidence interval for the single Binomial 
proportional p under observed (n, /<). 

The methylation ratio difference at a defined genomic 
locus from two biological samples is the difference of 
two Binomial proportions p\-p 2 . Many methods have 



been proposed to estimate the confidence interval p\-p 2 of 
and their merits have been subject to decades of consider- 
able debate [22,33-38]. No comprehensive comparison of 
currently available methods is available. This motivated us 
to turn to the direct and exact numerical calculation of 
confidence interval from Bayesian perspective. 

Let t = pi~p 2 , where p t is the proportion for the sample 
i with observation rit and k t . Since the joint probability 
density of such observation is f(p l7 n x , ki) f(p 2 , n 2 , k 2 ), 
the PDF for t is 

f(t)= f Wi(P 2 + *)/ 2 (P 2 ) 
Jo 

= f dpj^f^-t), (2) 
Jo 

where fiipi) =J[pf> n b ki). Boundary conditions like the 
proportional area condition, minimal length condition 
can be applied to get unique solutions for (a, b). 

Distribution for difference of difference 

Let t = pi - p 2) where p t is the proportion for the assay i 
with observation n t and k/. In the ox-BS experiments, p 2 
is the oxBS methylation ratio and p 1 is the RRBS methy- 
lation ratio, and t is the 5hmc methylation ratio. Since 
the joint probability density of such observation isjlpi; 
ni> h)fiP2> n 2 \ k 2 ), the PDF for t is 

f(t)= [ Y d Pl f x (p 2 + t)fM 
Jo 

= I dpj&tftfa-t), (3) 
Jo 

where /fa) =f[pi, n b k t ). 

Let t = p l -p 2 > where ' denotes the other sample. To 
be clear, call the two samples S and S'. In general we 
want to know the difference of the two 5hmc ratios, i.e., 
t-t'. Let x = t-t', we can immediately obtain the distribu- 
tion of difference of 5hmc ratio between two samples by 

/(*) = jj{t)f{t-x)dt = Jj{t' + x)f(t')dt', (4) 

where and/(£') are the distributions of 5hmc ratio 
for sample S and S' respectively. After distribution of 
difference of 5hmc ratio between two samples is ob- 
tained, similarly confidence interval, credible difference 
and similarity test p-value can be calculated. 

Distribution for measurements with replicates 

Here we use the exact numerical approach to calculate 
the distribution of p at observance (m if l t ) of with m,- as 
total count for replicate i and l t as methylated count for 
replicate L Let us start with 2 replicates. We try to fit 
this unknown distribution of p at observance (m lt li) 
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and (m 2 , h) into a Beta distribution J[p; a,|3). The param- 
eter estimation is based on the following formula 

P{k i ]n il a 1 ft) = / f{k i -n il p)f{p]a 1 ft)dp 1 (5) 
Jo 

where P{k b n b a } ft) is the probability to observe (n b Iq) 
under the Beta distribution f[p; a, (3), and fik^ n b p) is 
the Binomial distribution, i.e., the probability to observe 
(n b kj) under a specific true ratio p. For N number of 
replicates, (a, (3) may be estimated by maximizing the 
log likelihood function 

lo g L(a^) = glo g ^; — j, (6) 

where the expression inside log is the probability P{k b n b 
a, ft) defined in equation (5) and B (a, |3)is the Beta 
function. 
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